AITopics | training time

Collaborating Authors

training time

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Solver-Free Decision-Focused Learning for Linear Optimization Problems

Neural Information Processing SystemsJun-23-2026, 00:52:38 GMT

Mathematical optimization is a fundamental tool for decision-making in a wide range of applications. However, in many real-world scenarios, the parameters of the optimization problem are not known a priori and must be predicted from contextual features. This gives rise to predict-then-optimize problems, where a machine learning model predicts problem parameters that are then used to make decisions via optimization. A growing body of work on decision-focused learning (DFL) addresses this setting by training models specifically to produce predictions that maximize downstream decision quality, rather than accuracy. While effective, DFL is computationally expensive, because it requires solving the optimization problem with the predicted parameters at each loss evaluation.

adjacent vertex, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Genre:

Overview (0.93)
Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training

Neural Information Processing SystemsJun-22-2026, 19:56:15 GMT

Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time τgen at which models begin to generate high-quality samples, and a later time τmem beyond which memorization emerges. Crucially, we find that τmem increases linearly with the training set size n, while τgen remains constant. This creates a growing window of training times with n where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when nbecomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

artificial intelligence, machine learning, memorization, (19 more...)

Neural Information Processing Systems

Country: Europe > France (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training

Neural Information Processing SystemsJun-22-2026, 19:45:02 GMT

As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block's MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline. Codes are available at: https://casl-ku.github.io/FAL/

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AFaster Training Algorithm for Regression Trees with Linear Leaves, and an Analysis of its Complexity

Neural Information Processing SystemsJun-22-2026, 14:39:59 GMT

We consider the Tree Alternating Optimization (TAO) algorithm to train regression trees with linear predictors in the leaves. Unlike the traditional, greedy recursive partitioning algorithms such as CART, TAO guarantees a monotonic decrease of the objective function and results in smaller trees of much better accuracy. We modify the TAO algorithm so that it produces exactly the same result but is much faster, particularly for high input dimensionality or deep trees. The idea is based on the fact that, at each iteration of TAO, each leaf receives only a subset of the training instances. Thus, the optimization of the leaf model can be done exactly but faster by using the Sherman-Morrison-Woodbury formula. This has the unexpected advantage that, once a tree exceeds a critical depth, then making it deeper makes it faster to train, even though the tree is larger and has more parameters. Indeed, this can make learning a nonlinear model (the tree) asymptotically faster than a regular linear regression model. We analyze the corresponding computational complexity and verify the speedups experimentally in various datasets. The argument can be applied to other types of trees, whenever the optimization of a node can be computed in superlinear time of the number of instances.

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Deep Legendre Transform

Neural Information Processing SystemsJun-20-2026, 08:41:07 GMT

We introduce a novel deep learning algorithm for computing convex conjugates of differentiable convex functions, a fundamental operation in convex analysis with various applications in different fields such as optimization, control theory, physics and economics. While traditional numerical methods suffer from the curse of dimensionality and become computationally intractable in high dimensions, more recent neural network-based approaches scale better, but have mostly been studied with the aim of solving optimal transport problems and require the solution of complicated optimization or max-min problems. Using an implicit Fenchel formulation of convex conjugation, our approach facilitates an efficient gradient-based framework for the minimization of approximation errors and, as a byproduct, also provides a posteriori estimates of the approximation accuracy. Numerical experiments demonstrate our method's ability to deliver accurate results across different high-dimensional examples. Moreover, by employing symbolic regression with Kolmogorov-Arnold networks, it is able to obtain the exact convex conjugates of specific convex functions.

artificial intelligence, legendre transform, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

You Only Once Taking a Spectral Detour to Accelerate Graph Neural Network

Neural Information Processing SystemsJun-19-2026, 08:02:13 GMT

Training Graph Neural Networks (GNNs) often relies on repeated, irregular, and expensive message-passing operations over all nodes (e.g., N), leading to high computational overhead. To alleviate this inefficiency, we revisit the GNNs training from the spectral perspective.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

59ea33ae3d096f3bcd5026b479710cf8-Paper-Conference.pdf

Neural Information Processing SystemsJun-17-2026, 12:24:21 GMT

Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resourceintensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Education (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Neural Information Processing SystemsJun-16-2026, 18:56:17 GMT

Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > Mexico (0.28)
Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Dense Attention Latency: 1649s Radial Attention (Ours) Latency: 876s (1.9 Faster) PSNR: 27.3 (a) 117 Frames (Default Length)

Neural Information Processing SystemsJun-15-2026, 05:25:22 GMT

Recent advances in diffusion models have enabled high-quality video generation, b making ut the additional training and temporal inference dimension on long significantly videos prohibiti increases vely computational expensive. In costs, this paper diffusion, we models: identify post-softmax a phenomenon attention we term scores Spatiotempor diminish al as Ener spatial gy Decay and temporal in video distance o scalable ver space sparse between and time attention tok in ens nature.

artificial intelligence, machine learning, speedup, (17 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Media (0.67)
Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding

Neural Information Processing SystemsJun-14-2026, 18:47:06 GMT

Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions.

confounder, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: